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Abstract 

The  performance  of  cache-coherent  multiprocessors  is  strongly  influenced  by  locality  in 
the  memory  reference  behavior  of  parallel  applications.  While  the  notions  of  temporal  and 
spatial  locality  in  uniprocessor  memory  references  are  well  understood,  the  corresponding 
notions  of  locality  in  multiprocessors  and  their  impact  on  multiprocessor  cache  behavior  are 
not  clear.  A  locality  model  suitable  for  multiprocessor  cache  evaluation  is  derived  by  viewing 
memory  references  as  streams  of  processor  identifiers  directed  at  specific  cache/memory 
blocks.  This  viewpoint  differs  from  the  traditional  uniprocessor  approach  that  uses  streams 
of  addresses  to  different  blocks  emanating  from  specific  processors.  Our  view  is  based  on  the 
intuition  that  cache  coherence  traffic  in  multiprocessors  is  largely  determined  by  the  number 
of  processors  accessing  a  location,  the  frequency  with  which  they  access  the  location,  and  the 
sequence  in  which  their  accesses  occur.  The  specific  locations  accessed  by  each  processor, 
the  time  order  of  access  to  different  locations,  and  ’  _  /  ■  of  the  working  set  play  a  smaller 

role  in  determining  the  cache  coherence  traffic,  altho  r'i-  hey  still  influence  intrinsic  cache 
performance.  Looking  at  traces  from  the  viewpoint  oi  ^  -mory  block  leads  to  a  new  notion 
of  reference  locality  for  multiprocessors,  called  processor  locality.  In  this  paper,  we  study  the 
temporal,  spatial,  and  processor  locality  in  the  memory  reference  patterns  of  three  parallel 
applications.  Based  on  the  observed  locality,  we  then  reflect  on  the  expected  cache  behavior 
of  the  three  applications. 


1  Introduction 

Multiprocessors  often  use  caches  to  reduce  their  network  bandwidth  requirements.  Caches  retain 
recently  accessed  data  so  that  repeat  references  to  this  data  in  the  near  future  and  will  not 
require  network  traversals.  Repeated  access  to  the  same  data  in  a  given  interval  of  time  is  the 
property  of  temporal  locality  of  memory  references  and  has  been  well  studied  in  single  processor 
systems  [l,  2].  Spatial  locality  of  memory  references  is  another  related  property  of  nvetiiory 
references  that  places  a  high  probability  of  access  to  data  close  to  previously  accessed  data. 
.Again,  *his  p.-operty  of  single  processor  prograiii.s  luis  been  ".'i'b'ly  observed.  Tlie  "iabilily  c'l 
cache-coherent  multiprocessors  is  strongly  predicated  on  whether  the  mu!ti[>roces..or  caches  can 
exploit  locality  of  memory  referencing. 


*  I’rrliminary  rrsiills  of  this  study  werr  rrportfd  iti  s 


('ii'.ulv.  a  thorough  andiiig  of  flu-  iin'iiiory  across  patlorns  of  parallel  [)rocessiiig 

applu  auoi\s  is  uocossavy  to  determine  a  snitaUle  organization  of  the  memray  hiorarciiy  in  mnl- 
t iprocos^oi's.  For  example,  se\-('ral  cache  consistency  algorithms  proposed  in  the  literature  are 
l)a,'ei|  Oil  subtle  differences  in  the  expected  immiory  reference  [tatterns;  lacking  a  characterization 
of  niiilti[)r<x-ossor  memory  referencing  locality,  it  is  hard  to  obtain  insight  into  the  benefits  of 
one  scheme  over  another.  W  hile  some  previtrns  studies  have  looked  at  shared-memory  reference 
patterns  (e.g..  [:5j),  they  did  not  analyze  the  temportd.  spatial,  and  processor  locality  of  shared 
data. 

I'lifon unati'ly.  multiprocessor  locality  models  that  we  can  use  to  aid  in  our  understanding 
of  th-'  reference  patterns  of  iiarrdlel  systems  do  not  exist.  The  well  known  notions  of  locality 
Ml  single  processor  systems  do  not  carry  over  straightforwardly.  Consider,  for  example,  the 
seciueiice  of  memory  references  ci  Cirar.ir.^r.i  to  the  same  memory  block.  While  such  tem])oral 
locality  can  be  usefully  exploited  by  a  nni]>vi-)resse>r  caclte.  tiie  degree  to  which  a  multiprocessor 
Uses  siK  ii  locality  depends  on  which  processor  made  the  itidivivlual  lefeienci’s  and  whether  the 
is.f,.:,  [:res  were  r<'ads  or  writes.  'I  lie  negat ive  ext reiiH'  c;use  would  Cs-rrespoud  to  each  reference 
iu'ing  a  write  and  emaioiting  frmii  a  differeiii  proc<-ssor. 

Siuiiiarly.  block  size  effecis  are  liard  to  <'stimale.  Increasiitg  the  block  sizt'  could  improve 
'iseti.l  liu'aiity  try  cajituring  additional  d.iUi  words  in  the  block  that  will  be  referenced  by  the 
ju'oeessor  in  the  near  future.  llow('ver.  iwci  data  words  being  written  by  different  processors  could 
fall  inti.i  the  stiine  block  owing  to  a  block  size  increase  and  prove  harmful  to  cache  performance. 

We  present  a  simple  characierization  of  multiprocessor  memory  references  and  derive  a  lo¬ 
cality  model  that  is  useful  in  a  multiju'ocessor  context.  The  key  to  the  model  is  that  we  focus 
oil  the  set  of  references  liy  oik'  or  more  processors  to  a  given  memory  block.  We  introduce  the 
iiotit)ii  of  processor  loailitij  as  the  average  number  of  repeat  references  to  a  memory  block  by 
the  same  processor.  .S[)erific  varititions  of  processor  locality  can  l.e  defined  for  use  in  differ 
erit  aitplicati- ui.s.  For  example,  one  interesting  form  of  processor  lot  .dity  that  provides  insight 
into  ownership-based  cache  (■(dierence  schi'inos  is  the  sequences  of  repetit  references  to  a  given 
memory  block  liv  the  same  i>rocessor,  given  that  at  least  one  of  the  n'ference.s  is  a  write  f  ll. 

slighifx  difrereiU  definition  might  count  just  the  iiunibor  of  writes  to  a  blot  k  by  the  same 
prois'sMir  before  a  reference  by  another  processor.  Eggers  and  Katz  [o]  proposed  using  such  a 
a.etric  Ml  ci'aracferiziug  multiproces.sor  memorv  references. 

Ije'iih's  its  obvious  use  in  gaining  insight  into  the  performance'  of  e  ache  coherence;  scliemes, 
juex'essc  ![■  Iw(  ,1  lit  y  met  ric.-  e  a  n  ,i  Iso  lie'  iise'd  to  evaliiat  e  t  he  ofTicacy  of  leiock  st  riict  u  ring  a  Igor  it  h  nis 
propose,!  lo  enhance  locality  in  me'meu-y  re-ferencing  of  shareel  memory  multiprocessors. 

\\e  i;-i>  our  locality  rharacterization  to  analyze  the  lf)cality  patterns  in  three  parallel  ajqdi- 
Cations  using  aeldress  truce  data.  Multiprocessor  addre.ss  traces  are  derived  from  those  parallel 
I;  ppiicai  ion  1  running  under  I  hi'  .\F\( '  11  operating  syst eni  on  a  sha reel- memory  mnltiprocesseer.  .Xn 
i'Xtendeii  A  !  1  .\!  address  trtieing  scheme  imiih'mented  on  a  -l-CPF  DEC  V.\X  S.d.uO  [6]  provided 
the  trace  data  used  in  this  study.  The  applications  include  ParaOPS-u -  -a  parallel  implementa- 
tioii  ot  ilie  OPS.t  rule-based  language,  P-  Fhor  a  parallel  logic  simulator,  and  IxicnsRonte  a 
global  roii'er  |oi  \  FSI  slamlaid  cells.* 

Our  results  suggest  that  shaierl  refereiices  di.  play  a  significant  amount  of  ti'inporal  localit\ 
and  only  a  niodertite  amount  of  processf>r  locality,  d'he  average  nuiiil)e^of  read  and  wriie 

b  't '  M,  ,1 1  fins,'  ;,ri  no  " '  i  ,  .il  •  I  i  ’( )|’S.  I  H  ( )il ,  ,i  ini  I’ t.  If  ft  t  ”[  Is  i  i,  ,ii  r  oi  g,  i  imI  ik  s  '  U  s\  p,-, ,,,  , 

'*  b'..  ,  Iidii:.  ■!  liar.-  in  o  I  .  I . -i-l,  III  V  ,ll,  1  ,M,„.r.s  (.-j  'I  ,U  Ua.il  will,  llir  :  ,mi.'  h.  .11: 

■1  n '  I  !>  1 1  f  vi  r  <  <  h\  ( ||f  a  Ii  I  lo  tr-  ol  I  Id’  .1  .it  lolls 

•> 


’ I'lermci's  to  a  write-siuuad  tlock  botorc  a  ara  1  aiui  2  rf',spocliv(>!y.  1  lii.-' 

la/'alit  v  is  pxploitad  by  tlia  v.  riti'-back  das'-  i>l C.iclio  colMManra  srluMiH's  to  raduco  tli*'  co'-i  of 
rofeianu’es  to  shared  data. 

rids  paper  is  organized  as  follows.  Section  2  dalines  oiir  multiprocessor  model  and  the  termi¬ 
nology  used  throughout  the  paper.  Section  presents  background  information  about  the  .-M  L'M 
acKlress  tracing  tec'nni(|ue  and  lie*  applicatic.u.s  mcasurcul.  Sections  -1  constitutes  tlie  bulk  of  the 
paper  and  is  lUwoted  to  analyzing  locality  in  the  [larallel  traces,  and  studying  the  impact  of  the 
lu'ference  characteristics  on  cache  consistency  algorithms.  Specifically,  Section  4.1  assesses  the 
temporal  locality  in  shared  references.  Section  1.2  the  processor  locality,  and  Section  4.3  analyzes 
spatial  locality  in  tlu’  traces.  Section  1.4  focuses  on  how  the  memory  reference  characteristics 
affect  the  perfoiinance  (jf  various  cache  consistency  algorithms,  .lection  -'j  concludes  the  pajter. 


2  Cliaracteriz.'itioK  of  Memory  References 


Tins  sediou  irri'scnts  tlie  multiprocessor  uiodel  and  introduces  .some  nomenclature  to  help  ex¬ 
plain  memory  access  patterns  in  niulti;>rocessors.  4'he  notion  of  processor  locality  is  also  intro¬ 
duced. 

2.1  Multiprocessor  Model  and  Definitions 

The  multiprocessor  model  we  assume  for  our  analyses  is  straightforward.  We  assume  that  the 
system  consists  of  several  processors  each  with  its  own  cache  memory.  Memory  is  accessed 
through  an  ititerconnection  network.  Wc  mak<>  the  simplifying  a.ssumption  that  caches  are  in¬ 
finite  in  size  to  concentrate  on  traffic  caused  owbig  to  cache  coherence  related  actions.  The 
specific  organization  of  the  network  and  memory  system  is,  howmer.  unimportant  to  our  char¬ 
acterization  of  locality. 

We  first  introduce  some  nomenclature  to  help  explain  memory  access  patterns.  A  bhxh  is 
the  unit  of  data  transfer  between  the  cache  and  main  memory.  The  block  size  is  assumed  to  be 
1  word  (4  bytes)  unless  otherwise  stated.  The  small  block  size  is  chosen  so  that  the  refereuc<‘ 
behavior  for  each  data  object  can  be  derived.  However,  characterization  using  larger  block  .sizes 
is  also  important  to  study  the  spatial  locality  of  shared  objects,  and  is  dealt  with  in  Sections  4.3 
and  -f.d.  .\  rend-shnred  block  is  one  that  i.-  sliared  (accessed  by  multiple  proccssc's).  but  never 
written  into  for  the  duration  of  the  trace.  .‘\  irrih  -shaird  block  is  one  that  is  shared,  and  written 
at  least  once.  .\  epu-ahared  block  is  one  that  is  c-iiher  read  shared  or  write  shared. 

It  is  useful  to  have  a  notion  of  time  in  the  context  of  multiprocessor  execution.  Our  traces 
contain  intprleaved  memory  accesses  by  the  various  processors  in  appro.ximately  the  same  order 
tliey  occurred.  Howi'ver.  the  exact  time  at  whicli  the  reference  was  made  is  not  clear.  For 
example,  if  the  processors  i,  j.  and  k  each  made  refiTeiices  at  real  time  instants  t,  f  -t-  1 .  and 

on,  the  trace  might  have  the  references  i , .  jf.  k,,  jf  .  1 1  +  \  ~  ki^  \ .  whore  the  order  of  the 
roferoncps  of  the  3  jirocessors  might  be  random  with  respect  to  each  other.  The  traces  also  show 
<l'!-!ers  of  memory  references  by  the  saitu'  [Hocessor.  and  the  time  interval  between  reference.'- 
by  the  same  processrjr  also  varies. 

Owing  to  such  statistical  variations  in  the  |■eferen^e  pattern,  we  will  use  an  approximation 
to  foa i  f i mi',  riie  order  oT  i icf  u rri'iii'e  of  a  r<‘l< n-iice  in  the  trace  Is  on r  index  of  t i me.  So  the 


if'lVi'oace  in  tlie  trace  is  considered  to  liave  occurred  at  time  r?  Because  the  paper  considers 
si'veral  cases  wliere  the  traces  ari'  filtered  to  extract  specific  references  (e.g..  shared  user  data), 
to  enalile  comparisons,  the  time  index  u.sed  for  a  reference  depends  on  its  index  in  the  original 
trace.  I’or  example,  when  we  filter  out  operating  system  referenees  while  studying  sharing  in  the 
user  address  space,  the  time  index  of  a  user  reference  corresponds  to  its  position  in  the  unfiltered 
t  race. 

I  he  ensuing  definitions  for  displaying  inulti])roces.sor  locality  focus  on  the  sequence  of  pro- 
ctssori>  referencing  a  qivi  n  nu  niory  block.  Contrast  this  viewpoint  with  uniprocessor  locality 
tliat  typically  focuses  on  the  .oiy/icuo  of  me nioni  addresses  referenced  by  a  given  processor.  .\ 
reierence  to  a  block  11  by  processor  i  is  .sai<l  to  piiig  if  the  jrrevious  reference  to  that  block  was 
by  [irocr'ssor  j.  where  j  f  i.  We  call  such  re|erenc<'  a  pinging  reference.  Conversely,  a  reference 
to  a  block  II  by  processor  i  is  said  to  cling  if  the  [rrevious  reference  to  that  block  v.a.:  also  by 
processor  >.  Such  a  reference  is  called  a  clinging  /<  ference.  By  these  definitions,  a  ping  can  onl>’ 
occur  on  a  reference  to  a  shtired  block,  f’ings  ami  clings  to  a  block  arc  determined  simply  by 
keej)jng  track  of  which  processor  last  n'fere'tced  a.  block.  Similarly,  the  current  stati'  of  a  block, 
clean  or  dirty,  is  determined  .^ololy  b\  the  lefereiices  of  the  ]>rocessor  accessing  it  currently.  .V 
block  is  said  to  bt'  dirty  if  it  has  been  written  into  since  the  previous  pinging  reference  to  it. 
riii'iefore.  ;i  block  always  starts  out  ch'an  following  a  pinging  reference  to  it. 

Figure  i  depic.t.s  rt'ad/write  references  to  a  given  memory  block,  where  tjie  number  in  the 
.second  .s-.v  corresponds  to  the  processor  accessing  the  block.  The  reference  by  processor  3  at 
time  t-i-dS  is  a  pinging  read  refenmee,  the  reference  at  time  1+25  is  a  clinging  write  reference. 


2.2  Characterizing  Locality 

I'iie  notion  of  clings  and  pings  allows  the  derivation  of  simple  criteria  for  multiprocessor  memory 
reference  locality.  The  a|)pealing  feature  of  clings  and  pings  is  that  they  do  not  depend  on  imple¬ 
mentation  detaihs  such  as  cache  sizes.  In  addition,  they  provide  ust'ful  information  about  cache 
performance.  For  example,  assuming  a  local  cache,  clinging  read  references  do  not  cau.se  a  net¬ 
work  transaction:  on  the  othm-  liand,  pinging  write  rc-ference  always  cause  a  network  transaction. 
’Fh*‘  ensnimi  discussion  uses  statistics  derived  from  [lings  and  clings  to  study  locality. 

hnipond  locality  is  disjdayed  by  rcTer-mces  to  a  given  block  of  data  that  a’c  clustered  in 
time.  .Small  time  intervals  between  clinging  oTerences  denote  a  useful  form  of  temporal  locality 
ill  mnitipri.Hes.sors;  con\eisely.  small  time  intervals  between  pinging  references  is  potentiailv 
ii.ii'miul.  In  the'  reference  secpience  depicted  in  Figure  1  temporal  locality  of  clinging  references 
is  me /re  evident. 

'1  nil*'  iniei  vais  beiwc'cn  pinging  and  clinging  references  are  a  u.sefiil  method  of  depicting  the 
•  (■mp/iral  lo/  alily  of  .shared-memory  references  and  can  yield  mseful  insights  into  tlu'  bc'havioi- 
ol  small  caciies  in  multiprocessor  environments.  However,  a  block  might  reside  in  a  large  cache 
for  long  [leriods  of  time  without  br‘ing  displaced,  making  the  relative  setpience  of  references  to 
a  ui'.  eii  blo(  k  1  )\  various  processors  a  more  important  determinant  of  cache  performance.  The 
lorm  of  locality  that  becomes  more  iin[)orlant.  then,  is  called  processor  locality. 

rrni'i  .-.SO'-  locality  is  t  he  tendency  of  a  [irocessor  to  access  a  block  rej)eated!y  br  Fare  ati  access 

m\.  lliat  fiiip  time  (li.'-i  Iti<  tii)ii>  are  not  --innili' alil  in  our  stndv.  If  approximate  real  lime,  fine  e.ni 

kf'  p  a  virtual  'y-tem  lime  i m  rem  ii led  l,v  (jn<-  unit  for  e\eiy  n  r>  ferenees  in  lie'  trat'e  where  n  s  the  iiimdver  //I 
I’re"""  a-  III  illiia  vv.jiti,,  ||r  liiif:  /-p/.di/al  ui  in  pap.  i  -  an  he  divide.!  I.v  1  |,,  ^.  |  a  roiinh  id. a  of  ihe  i.  .d 
lim. 
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t  iguro  I:  Cliaracterizing  localit>  in  multiprocessor  momoty  references.  Various  proces.sor  accesses  (rep- 
re.sented  by  the  riumbes  in  tlie  si  cond  row)  of  a  giva-ii  bloclc  B  are  shown,  r/w  stand  for  reads  or  writes. 
Tile  time  instants  with  no  correspotiding  references  im[)ly  acce.s.ses  of  blocks  other  than  B. 


from  another  processor,  and  is  measured  by  the  average  length  of  the  sequences  of  clinging 
references.  In  F  igure  1,  the  average  number  of  dinging  references  before  a  jtinging  reftnencf'  is 
(1^3-L4)/3. 

W’e  can  derive  a  class  of  processor  locality  metrics  for  u.se  in  different  applications.  For  exam¬ 
ple.  a  characterization  that  does  not  distinguish  between  read  and  write  references  is  enough  to 
analyze  cache  coherence  schemes  such  as  the  Dir\N B  directory  scheme  studied  in  [8].  However, 
this  definition  is  unsuited  for  a  cache  coherence  scheme  that  allows  multiple  cached  copies  of 
clean  blocks.  Therefore,  a  more  practical  definition  of  processor  locality  measures  the  average 
length  of  those  sequences  of  clinging  references,  where  at  least  one  reference  is  a  write.  This 
definition  yields  (3  +  4)/2  as  the  measure  of  processor  locality  for  our  example. 

In  general,  we  can  use  the  following  notation  to  describe  a  processor  locality  metric: 
wt^'  tp.  Here,  r  and  w  denote  reads  and  writes  to  a  block  by  a  given  processor,  + 
denotes  one  or  m.ore,  and  *  denotes  zero  or  more.  Sequences  by  the  same  processor  are  termi¬ 
nated  by  a  pinging  reference  of  type  t.  The  type  of  the  pinging  reference  can  be  a  read,  write. 
or  either  (denoted  r.  w,  r/w).  The  length  of  the  sequence  determines  the  processor 

locality.  In  this  notation,  the  two  definitions  of  processor  locality  in  the  previous  paragraph  art' 
r*  w‘  r/Wp  and  r*  r/wp  respectively. 

Processor  locality  measures  locality  in  shared  references  alone.  It  is  meant  as  an  aid  to  gain 
insight  into  the  shared  reference  patterns  of  parallel  programs  and  usually  cannot  be  used  to  ob- 
t;un  performance  data  directly.  For  instance,  an  apjdication  that  has  very  few  shared  references 
will  have  a  low  rate  of  cache  coherency  related  transactions  even  with  abysmal  processor  local¬ 
ity.  Consequently,  a  performance  model  might  consider  using  the  fraction  of  shared  references 
ill  addition  to  the  processor  locality  parameter. 

A  direct  impact  of  processor  locality  is  noticed  in  the  performance  of  various  cache  consis¬ 
tency  schemes,  which  exploit  different  locality  pattern.s  in  references  to  read-shared  or  write- 
shared  blocks.  Notice  that  a  high  temporal  locality  of  pinging  references  yields  a  low  processor 
locality,  and  negatively  impacts  the  performance  of  multiprocessor  caches. 

Spatial  locality  is  the  tendency  of  processors  to  access  data  in  the  vicinity  of  a  recently 
accessed  memory  word  in  a  given  interval  of  time.  Clearly,  a  useful  form  of  spatial  locality' 
increases  the  probability  that  a  given  proces.sor  accesses  words  in  the  neighborhood  of  words  it 
ai  cessed  recently,  while  the  onposite  form  of  soatial  locality  increases  the  rate  at  which  other 
{irofcs.sors  access  the.se  words.  Put  another  way,  spatial  locality  can  be  u.seful  in  multiprocessors 
if  a  larger  block  size  increases  the  processor  locality  of  shared  references.  As  we  will  show  in 
Se(  tjf)n  1.3  increasing  the  block  size  does  not  always  increa.se  the  [uoces.sor  locality. 


3  Applications  and  Data  Collection 


Our  stuiiy  is  based  on  trace  analysis.  'I'lie  trace."  are  obtained  using,  a  multiprocessor  e.xtension 
of  the  .\  ri'M  tracing  scheme  [!»j.  .^TT.M  stands  for  .Address  Tracing  Using  .Microcode  and  works 
as  follows:  During  the  e.xecution  of  each  instruction,  the  microcode  writes  out  the  inemory  refer¬ 
ences  made  by  the  processor  to  a  portion  of  memory  reserved  for  tracing.  In  the  multiproces.sor 
extension  of  .A  l'U.M,  each  access  to  trace  memory  is  interlocked  toenable  the  microcode  in  several 
processors  to  write  their  rc'ferences  to  this  memory.  I'hus  a  trace  contains  interleaved  address 
streams  of  several  jtrocessors.  The  tract's  list'd  for  this  study  were  gathered  on  a  4-('PU  \’.\X 
''.'U>l)  maciiine  running  the  .MAUlf  operating  systi'in.  Kach  trace  is  roughly  3.5  million  references 
long.  In  addition  to  addresses.  .ATUM  records  the  opcodes,  and  the  virtual-to-physical  trans¬ 
lations  that  occur  during  translationdookasidt'-buffer  misses.  A  location  is  considered  shart'd 
when  it  is  referenced  by  more  than  one  CPU.  Ih'cause  different  processes  could  access  a  given 
"hared  location  with  different  virtual  addres.ses,  sharing  is  detected  by  translating  the  varicuis 
virtual  addresses  of  a  shared  location  to  its  common  (ihysical  address. 

The  traces  used  in  this  pajier  are  obtained  from  three  programs;  ParaOPS5,  P-Thor.  and 
LocusRoute.  ParaOPSo  [10]  is  a  parallel  implementation  of  a  rule-based  programming  language 
called  ()PS5,  which  is  a  widely  used  languages  for  the  building  expert  systems.  It  exploits 
parallelism  at  a  fine  granularity  and  makes  extensive  u.se  of  the  shared  memory  provided  by  the 
architecture.  P-Thor  is  a  parallel  implementation  of  a  logic  simulator  done  by  Larry  Soule  at 
.Stanford  Utiiversity.  The  simulator  transforms  the  task  of  circuit  simulation  into  a  series  cf  node 
“Valuations,  where  each  node  corresponds  to  a  device  in  the  circuit.  The  parallel  implementation 
evaluates  these  nodes  mi  ptuudlel.  while  h.andline  the  dependencies  between  them.  LocusRoute. 
is  a  iiarallel  \'r..SI  router  written  by  .Jonathan  Rose  at  Stanford  [11]. 

3.1  General  Statistics 

I  allies  1  a  lid  2  present  some  t  race  si  aUtst  ii  s  lelevant  to  'his  st  "dy.  Hecause  the  instruettop  snae-' 
is  U'U.'iil  v  i-ead-oid\'.  it  can  lie  i  re.atc'd  .s|)eci:illy  in  memory  management,  and  so  the  statistics 
pi'i'sented  in  this  paper  corn'spond  to  data  references  alone.  The  columns  in  Table  1  denote 
tiie  i.ital  number  of  user  references,  user  data  references,  user  data  shared,  and  shared  write 
leferemes,  Instruction  and  data  reh'rences  are  about  equal  as  expected.  In  ParaOPSb,  P- 
rhor.  and  l.orusKouti',  shared  d.ita  referenct's  comprise  roughly  20%,  10%,  and  .3%i  of  all  user 
I efcienci's.  riie  corres[)onding  fractions  rif  shared  write  references  are  about  3%,  1%,  and  0.2%. 


lable  1:  Summ.'iry  of  (lyii.-iiiiic  trace  cliaractfuistics. 
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1  scr  Ri’fei  eiices 
(thousands) 

Data  References 
(thousands) 

Shared  References 
( thousands) 

Shared  Whites 
(thousands) 
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2S17 

1310 
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77 

P-  i'luir 

1.527 

320 

21 

!.(!(  u  '  hoM  !  e 

32  r.! 

1 52S 

1  10 

0 

I  he  "fatisiic-  in  'fable  display  I  lu'  , lumber  of  unitpie  user  blocks,  iinicpie  shared  blocks, 
and  the  uni(|ne  shared  written  blocks  in  the  traces. 

Our  a  n :  !  \  -es  in  this  paper  h icnses  on  user  references  alone.  I'xcept  P-  I'hor,  enr  a[)plicat  i  - ui 
d.d  lull  li.'ue  .1  signilirant  amount  of  proce.ss  migration  related  sharitig:  the  few  blocks  that  are 
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Tabl*'  2:  Suniinary  of  static  trace  rharaci.-rl'.i  i -s.  (')iil\  iisi-r  t|aia  blocks  arc  considiTicl 


Frace 

Data  Blocks 
f  t hou.'.'inds ) 

Sli.iied  l)ata  Bhuk'' 

1 1  hoiisaiiils  1 

\\  l  it  (*  Sli artnl  Blocks 

(  t  ll  t  Ml  .•scliul  S  ) 

ParaOPSo 

20.3 

19.S 

•1.0  1 

P-Thor 

71.9 

I.S 

1.3  ! 

FocusRoute 

1  l,(i 

0.7  ■ 

1 

sliarod  by  multiple  procossoi^  -.ololy  dtio  lo  proii'ss  laierat ion  .iro  not  roiintod  in  with  sliaif.! 
blocks.  Results  on  shaiitie  in  'bf'  ('perat inn;  sV'ti  in.  and  •'liariny  <'\'.ina;  to  |)roco.ss  iniera’ioii  cc.! 
be  iuutu;  in  :-l;. 

4  Results  and  Analyses 

riiis  section  first  analyzes  teinjroral  locality  in  tlie  traces.  W'e  then  (-valuate  the  processor  !oca!ii>' 
in  the  traces  and  the  impact  of  bkick  size  (rn  tliis  paramot('r.  Wc  ('vabiatcd  throe  difTcrent  cache 
coherence  schemes  by  the  amount  of  traflic  they  generate  for  various  block  sizes.  This  paper 
summarizes  our  findings  and  uses  processor  locality  as  a  means  of  gaining  insiglit  itito  tiieir 
behavior.  Unless  stated  otlierv.ise.  wo  assiiim'  infinite  cacliec  ai;d  •1-byle  blocks. 

4.1  Temporal  Locality 

'I'liis  section  deals  witli  dynamic  memory  access  patterns  and  cliaraclt'rizcs  tlie  lornpora!  ioctiliiv 
of  cpu-sliared  user  data  lef  rences.  Wo  present  tlie  median  of  llio  dislriitntion  of  time  intervals 
between  clinging  and  pinging  references  in  'raltle  3  to  demoiislriite  t.)io  tcmj)oral  locality  of  dal;, 
references.'^  The  average  interval  of  time  lietwoon  arce.sses  to  tlie  same  sliared  block  tend.s  to  in 
large  because  even  one  reference  with  a  very  large  interval  (or  an  outlier)  can  skew  the  avm'.ige 
towards  large  values.  Such  outliers  are  not  important  for  two  reasons.  First,  in  practical  (ini’' 
sized  caclies  the  mucli  shorter  cache  lifetime  of  blocks  would  preclude  such  large  values.  Second, 
the  large  values  in  our  applications  is  chielly  ''ue  to  clinging  references  that  occur  when  tlu' 
process  resumes  execution  on  the  same  processor  after  being  switched  out.  Therefore,  in  the 
context  of  time  intervals,  a  more  interesting  number  is  the  median,  or  tlie  time  interval  over 
whicii  lialf  tlie  ciingitig  or  pinging  references  occur. 

Faille  3;  Temporal  locality  characteristics.  Only  user  data  blocks  are  considered.  Block  size  is  1  word  (-1 
bvles)  Numbers  denote  tlu'  median  of  the  freipieiiry  distribution  of  time  intervals  between  tliree  events 
clinging  references,  [linging  references,  and  pinging  references  to  a  dirty  block. 


Trace 

Cling  References 

Ping  References 

Pings  to  Dirty  Rlks 

ParaO  PS5 

23 

10 

363 

P-Thor 

25 

7 

1779 

LoensKoute 

2818S 

13869 

19711 

Ill  ParaOl’So  and  P-TIior  over  50%  of  the  intervals  between  clinging  references  are  2  )  time 
units  or  less.  NTit  surprisingly,  these  numbcMs  sliow  that  blocks  are  re-ref('renccd  at  small  intei  vals 

l  or  ileOilleii  frp<|iienrv  (list riliii ( ion  grapli.s,  sfr  f-lj. 

7 


i  i!  'lii  ii  !'  'iiiipiv  a  :'i',  .ailiriiiaiiciii  iil'ii.*'  la-'ii'f  t iiat  iiH'iiiory  rol'cia'iK't's  dihplay  a  Inyn 

•.  l.a  .iM\.  ami  m  llii'  pli'i  l^l‘  if.isoii  (  acliii.i;  i-i  suC(a'>,slul. 

i  IK  !ia>  <i  iiiiicii  iaii;('r  interval.  In  \vir<‘.s  arc  selected  at  raiidoni  ami  a 

.  ;lM^c^  iisiiii^  i',^{  \a!;n-s  i'roin  a  sljarcd  malrix.  Becansf'  a  wire  inichi  he  sf'lectcd  h\ 

;i;.  M, r  .it  laiiiloin.  tiiiTi'  !■'  In)  sicniticaiit  tcm|)(iral  locality  in  reff'rc'ncinc  the  elements  in  the 
'■M--  ::ia';:\.  A'l  .ilenritliin  wilii  hett('r  temporal  locality  might  favor  routing  wires  in  a  given 
imm;.  i.oriio,  ai  ratln'i'  than  choosing  a  wir<‘  at  random  to  incre.ise  the  probability  a  given  word 
o  •'  o  ft  I,  iii  eii  soon,  Su(  h  ;t  ciioic*' will  benefit  spatial  locality  also. 

i  i.toc  temporai  im.ility  results  are  compared  with  those  for  pinging  references,  or  for  a 
o'lfo  ioe  to  ,1  block  b\  a  firocessor  followed  bv  a  r.'ference  from  ancither  processor.  T  he  time 
li'  :;t'!-e  are  inieresiingty  lower  than  for  (liiiging  references,  which  says  that  references  to 
oioi  k'  b\-  dilferent  jtrocessors  are  nsnally  at  least  as  finely  interh'aved  as  references  bv 
■ .line  pi'oces~or.  Doubtlessly,  the  cause  of  tlm  liigh  temjtoral  locality  of  ])inging  rt'ferenci 
'  ■'  .1’  oi;r  ,ip[iiications  exploit  parallelism  at  a  line  granularity,  and  the  use  of  spin  locks  for 
■'.lit  :  I  o  . : ; !  / .  i : :  ( i  1 1 . 

\-  !’i  ; !i ■  ei (oi  1  nil  .(..nle,  m  addition  to  a  lit-'  peak  at  a  low  linnt  interval,  onr  fri'quency 
o i n ■  ion  p  jp  ping'  showeil  ,i  'mail  secoml  pe.ik  .0  2')(>  time  units  in  I’-TTior  owing  to  t  lie  pro 
t  .it  inii  ’o  anot  her  processor  following  a  context  switch.  If  the  level  of  process  migration 

■  ;  .  ':,i'  ',)e,ik  ,it  a  iarge  time  interval  can  become  much  taller,  which  falsely  sugg'.'.'ts  tiiat 

p:  o,  ,  ; ii:  at  loi:  lov. ei '  the  ; .'m f)  u'al  locality  of  'hated  referetices.  In  reality,  process  migrat  loi. 

'imp.-,  m.ose-  ,1  i.iiiK  IVaction  of  the  lonically  priwite  blocks  appear  sharml.  and  it  is  refi'rmice' 
'  '  S'  ■  -  i  1.1  fed  iiloi  k'  .done  :  hat  (  auses  the  tall  'econd  peak. 

i  jifeiioii-  to-nii'  ilid  no!  di'iingnisli  betwi'en  read  and  wrii('  references.  Making  tiii' 
ci-'im't.on  i-  :iei.'"a!\'  la'cause  in  maiiv  Idgh  perfonmuici'  mult  iproci'ssor  architecture's,  write' 
at.d  poiiiitiii  reference'  to  dirt\'  Idocks  cause  bus  irafiil  becau.se  th('  new  value' of  the  dirty  block 
:  ,  :e  'omeliow  t ra tn- m i 1 1 ed  to  tin'  re'ejiH'st ing  processor.  T'he  time  intervtd  between  pinging 
t’  f.  y.-ncio  :,i  di(t\  block  for  the  tlir<'<'  applications  is  far  greate'r  than  the'  corre.spotiding  time 
la  I  '.'.c-n  III!  jfingina  t eferences.  TTi'’  high  fre'ejiie'ue-y  of  |)inging  re'ferences  at  low  time  inte'rvaN  i' 
I  'an  'do; .  ■  ,1 ' !  1 1  b'l  I  a  b!e  t  o  read  M  Ten'll  res.  ;\  possible  c.i.se'  is  t  In'  te'st  ■  a  nd  - 1  e'st  A' set  synchrernizat  ion 
e.  -.‘.  li'ie  one  might  expei't  multi|>|e  re'.ids  from  se'veral  jrreice'sserrs,  but  le'ss  froepient 
'.'liO  I  |o'.>  t.'i'iporal  loc.dily  in  pinging  refeiemes  tee  eiirly  telorks  ('iirern rages  ns  to  belii'M' 
ha’  to;  !,irn''  time  ja'iiod'  bha  ks  (an  be'  ronsidemd  as  private-  and  no  traflle'  neeel  be  generate-d 
;n  til. '.in' an  inti  coii'iste'ii t  enn  hes.  One  conclusion  ol  this  ob.se'rvat ion  is  that  cache  managenieant 
'■  ! . .  imi'i  ''ip[)o!i  elli(  ient  leaei  sharing  of  bloc  ks. 

1.2  Pfoersseer  Loraliiy 

\-  ,  ;.ci]('s  aifCiV  ijigger.  Idock'  are  expe'cted  to  stay  in  the  caeTn'  lor  letiig  jjerioels  of  time.  T  hen. 
,1  better  characti  ri/ation  use,,  tl.e  iiotiem  of  proce'ssor  locality.  Onr  eli-rnssion  here-  addtcs.se.s 
proce-'or  locality  in  two  ways.  ITu'  firs’  uses  the’  mirnbe'r  eef  elinging  references  to  a  bloe:, 
/  ■  ('■'  r'lr,}.  and  i!i<'  second  the  tittmbei  of  clinging  re'fe'rences  to  a  Idock,  give'ii  that  at  least 
one  of  t|ie>  leferetiee's  wa '  a  Write  r'  ir^  I'/ll'),- 

1  itinre'  2  show  s  tin’  fre’(juein  y  histogratit  of  tin-  itumber  of  clinging  references  to  a  block.  giM  n 
,if  h'.isi  one’  ii'lere'ina’  was  a  wiite'.  Dm-  te)  the-  wide*  range  of  the  number  of  refe'rences.  the  biii' 
on  the  X  a-'  i'  increase'  iti  priwe-rs  of  two;  a  bar  at  .r  with  height  >/  in  the  frequene’v  histogram  plot 
implies  tj  s<ejiieine's  of  clinging  ta-fe-re'iice’s  of  length  /.  such  that  x  <  t  <  2x .  Here,  we  will  use 


s  bi'caust'  tho  avcrau,!  is  nun'o  indicative  .if  |)r(i(-c.s.sni  liu  ality  tiiaii  tlio  median;  <niilierx 
' ••;)ie>(.'nt  a  lariiX'  numiK'r  of  refcia'iwns,  and  iinml  be  weiohied  m  (airdinaly. 

Si'vcral  observations  can  Ix'  made  from  l‘iu,nre  '2.  first,  tlie  avr'raf’c  numlx'f  f)l  clinyiim 
leferenccs  to  written  blocks  is  a.b  for  ParaOl’Sb.  .'{.(i  for  I’-  i  iioi  .  and  7.")  for  I.ocnsRon te.  W  liie 
I.  I'eia'iices  arc  much  fewt'r  than  reads  and  lonlribnte  l.fi,  |,7.  and  1.2  respectively  to  t  liese 
aveiayes.  l  lie  write  refertmee  setinemes  (airrespond  to  ilu>  form  of  processor  locality  denoted 

a- 

W'e  found  a  ='gnificantly  lower  processor  locality  in  the  distributions  for  clinging  references 
wiu  n  ue  relaxed  the  rerpiiremenl  that  each  serpience  have  at  least  one  write  [}].  Ftir  example, 
h’l  I’- riior.  there  .art'  about  200.000  pinging  refeiamces  trr  a  block  referenced  only  once  b>'  ilu' 

ci(jus  processor.  The  (  orrespondingly  low  average  ol  I  ..'1  for  P-  J  hor  indicates  that  interle.avod 
f,  rejK  e-  by  d.fferr'nt  '.rores  (.rs  are  a.s  fiaapient  as  clinging  references,  implying  low  prcce.i.soi' 
huudity,  (  The  overages  'or  P;.  .lOPS-a  and  l.ociisRonle  weo'  [y.  and  2.5  respectively.)  .4  cache 
'lc;o^’e^cy  scheme  tha  allowed  only  one  cached  rop\  of  any  block  ['']  jterformed  abysmally  for 
■  very  reason.  .Vnotl-er  ini  lortant  observation  is  that  the  total  number  of  pinging  references 
o  .  d;:  'y  blocks  are  apjiroximately  an  oiah'r  of  niagnit  nde  lower  than  all  ytinging  references,  which 
'■'wer-  tlie  overall  ratt'  of  cache  consistiuicy  relate.!  transactions. 

One  of  the  chief  differtmeos  between  .some  of  the  lache  consistency  schemes  is  the  way  they 
tieat  write  references.  One  set  of  schemes,  e.g,.  DH.AOOX  [12]  or  1'1REFL4’  [Id],  allow  caches 

nold  valid  copies  of  blocks  that  are  being  written  into  by  others,  and  receive  ujtdates  of  the 
vc.lm's  on  writes,  .\nothor  set  of  schemes  allow  only  oin'  (:o[>y  of  a  written  block  (e.g..  Berkeley 
Ow  nership  [M].  or  various  flavors  of  directory  schemes  [S]).  The  performance  of  update  vutsus 
!;:\alidate  is  predicated  on  the  locality  of  nTerences  to  write-shared  blocks.  .4s  noted  earlier,  the 
average  number  of  writes  to  a  block  hefor"  a  i)inging  reference  is  small  although  not  unity  (.1.7 
fnr  P- Thor)  implying  that  cither  method  will  mit  overwhelmingly  outperform  tlie  other.  Our 
results  in  Oio  next  section  show  that  the  invalidate  and  update  schemes  perform  similarly  for  ' 
word  block  sizes  and  hear  out  this  intuition. 

I'here  are  several  possible  reasons  for  the  low  value  of  clinging  write  references.  VVe  expe-' 
a  low  value  for  write  references  to  spinlocks.  Wo  also  expect  this  value  to  be  low  for  migrator.v 
-.hared  objects  [7]  which  move  from  one  proc'vs.sor  to  another,  with  each  processor  making  sonm 
modifications  to  the  object.  Also  mostly-read-only  objects  are  written  once,  and  then  numcrou-. 
pinging  r"ad  references  are  made  by  other  pror<>s.sors. 

We  also  studied  the  flistribution  of  the  niitiiher  of  clinging  write  references.  A  surpri.sing 
ob.servation  wa.s  that  a  significant  fraction  of  clinging  seriuences  haci  exactly  one  write.  The  l.irger 
average  is  due  t(j  a  small  number  of  clinging  write  secpionces  with  sever.al  lens  of  writes.  L’his 
dichotomous  nature'  of  clinging  se-epiences  siig.gests  that  comjrotl.ive  cache  coherence  schemes  [15] 
that  can  resort  to  invalidations  when  the  number  of  write  updates  crosses  a  Ihreshokl  might 
he  the  right  scheme  to  use.  We  also  noticed  that  it  was  usually  the  synchronization  objects 
that  resultf'd  in  a  clinging  setiuence  with  exactly  one  write.  .So  another  possibility  would  be 
to  use  an  iipdate-ba.sed  protocol  for  synchronization  olijects,  while  using  an  invalidation-based 
coherence  protocol  for  all  other  data  ol>jerts.  In  an  environment  where  processes  can  migrate, 
yet  another  srheme  might  use  invalidations  for  private  data  objects  spuriously  shared  dim  to 
process  migration  and  use  updates  for  other  blocks. 

In  summary,  we  saw  that  the  processor  hic.dilv  of  sliaia’d-rererences  is  moderate,  with  l'(•llghl^ 
J  wiites  and  -1  reads  on  average  lo  write-shared  objects  Ix'fori'  a  pinging  referenc<‘.  (liven  the 
iiuxlerate  processor  iocaJity  of  shared data,  invalidating  schemes  such  a.s  the  Berkeley  Owm'r- 
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Figur^'  2:  Distribution  of  the  number  of  references  to  a  blocK  before  a  pinging  reference  to  the 
block,  given  that  at  least  one  reference  was  a  write.  Only  shared  data  references  of  user  are  inclun 
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'-I'ip  protocol  or  clirociorx  s(  li('ni''s.  ami  lln‘  U])(la(iii"  |)rotocols  such  as  the  Dragon  and  1  'ndh 
.1  Innuos,  am  oxpcticd  to  hare  .^im'dai  pi'ilurniama'.  W'c  vciilicd  this  using  siinnlai  i'ui  in  Sc. 

I  ion  1.1. 

4.3  Spatial  Locality 

We  now  ('xamiiK'  th('  oficcis  of  spatial  locality  on  the  perforniance  of  cache  cohorence  sch<'uie^. 
1  iguo's  '■]  ami  1  plot  piocessor  locality  histograms  for  the  throe  applicatioms  for  block  .sizes  oj 
lb  and  b-l  bytes,  d  h('  averagt's  for  the  thriH-  block  sizes  are  shown  in  Fable  1. 

la'.)lc  d:  Spatial  locality  and  the  Impact  of  block  size.  Only  n.s('r  data  blocks  are  considered  \mnl)crs 
denote  the  average  of  the  mnnber  of  clinging  reh'reiices  to  a  block,  at  least  one  of  which  was  a  write. 
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117.2 

Increa.sing  the  block  size  impacts  the  variotis  app)ication.s  differently.  LocusRoute  shows  a 
substantial  iniprovenient  in  processor  locality  (X.S  to  117.2)  as  block  size  is  increased  from  -1  tet 
bl  bytes.  The  reason  for  the  substantial  improvement  for  LoctisRoutc  is  that  it  has  a  central 
data  structure,  called  the  co.st  array,  accessed  very  frequently  and  in  a  regular  fashion,  thtis 
re.siilting  in  high  sftatial  loctilily  of  references.'’  In  comparison  to  l.ocusRotite,  both  P-Thor  and 
ParaOPSo  show  improvemetits  of  a  mticli  smalh'f  magnitiidt'.  and  in  fact,  the  processor  locality 
I  o'asure  decreases  slightly  for  ParaOPS-b  as  we  go  from  16  to  61  hyte  blocks. 

Wdry  does  block  size  itnpact  processor  locality  so  differently  for  various  shared  applications 
.As  tlie  block  size  is  increased  the  potential  for  references  to  adjacent  words  increases  and  two 
opposing  forces  come  into  phiy.  If  the  probability  a  given  processor  accesses  a  word  in  the 
vicinity  of  a  word  it  accessed  before  increases,  then  the  processor  locality  is  likely  to  improve. 
Cotitrarily,  a  larger  block  size  increases  tiie  probability  of  unrelated  shared  words  residing  in 
the  .same  block,  and  a  write  to  one  word  can  cause,  a  ping  to  the  entire  block  currently  being 
accessed  by  another  processor.  Clearly,  the  applications  display  differing  degrees  of  both  effects. 

Let  us  lock  at  the  issue  of  same  proce.ssor  versus  dilferent  processor  accesses  of  a  block  more 
coiirretuly.  Lxamiue  the  processor  locality  distributions  for  ParaOPSS  when  block  size  is  16 
bytes  aticl  when  it  is  61  bytes  (see  top  of  Figiire.s  3  an<l  4).  Wb  see  a  significantly  larger  number 
of  occurrences  with  X-lb.  16-32.  32-64.  and  64-r2X  clings  before  a  ping  as  we  move  from  16  to  64 
byte  blocks.  This  increase  is  due  to  the  si)atial  locality  in  the  references  of  a  single  processor,  the 
j)ositive  force,  llowevei',  we  also  see  a  signiFicant  increase  in  the  number  of  occurrences  wiiere 
there  are  only  1-2  cling.s  )>efore  a  ping  a.s  we  move  from  16  to  64  byte  blocks.  This  is  a  result  of 
the  interference  effect  discussed  abrzve.  and  ovcuall  it  nullifies  the  advantage  of  the  large  block 
si/e. 

’  .Noir  Ui.it  tlip  lieiglil  of  th<'  (listribution  beroiin-s  siiialb'i'  .is  block  sizes  are  increased  l)ecau.se  even  small  valni  . 
.it  l!ic  tail  end  of  the  distribution  rorresjiond  to  a  lar);e  iiiiiiilier  of  references.  For  examiile,  in  LocnsRoiile,  tin  n 
.lie  IS  se(|iiences  of  length  between  Z-SG  and  512.  whn  li  aciuiiiil  for  several  thousand  references. 
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Figure  3:  Distribution  of  the  number  of  references  to  a  block  before  a  pinging  reference  to  the  same 
block,  given  that  at  least  one  reference  was  a  write.  Only  shared  data  references  of  user  are  included. 
Block  size  is  16  bytes. 
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Figure  4:  Distribution  of  the  number  of  references  to  a  block  before  a  pinging  reference  to  the  same 
block,  given  that  at  leaust  one  reference  was  a  write.  Only  shared  data  references  of  user  are  included. 
Block  size  is  64  bytes. 
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4.4  Cache  Consistency  Implications 

Processor  locality  impacts  the  performance  of  cache  coherence  schemes.  We  examined  the  per¬ 
formance  of  several  cache  coherence  schemes  through  simulation  with  the  ATUM  traces  for 
various  block  sizes  ruid  used  our  notions  of  processor  locality  to  gain  insight  into  their  behavior. 
Our  findings  are  summarized  here.  We  also  discuss  .some  of  the  limitations  of  our  definitions  of 
processor  locality,  and  suggest  modified  definitions  for  use  in  specific  applications. 

Of  the  several  cache  coherence  schemes  proposed  in  the  literature  (c.g.,  [16,  14,  12,  17, 13]),  we 
consider  a  representative  each  from  the  write-through  with  invalidate,  write-back  with  invalidate, 
and  write-back  with  update  classes  of  cache  coherence  schemes  assuming  a  shared  bus  as  the 
communications  medium.'’  To  help  explain  the  various  phenomena  we  observed,  we  use  the  data 
presented  in  earlier  sections.  .Vs  '  efore  we  assume  infinite  caches,  and  unless  otherwise  stated, 
block  size  is  one  word  (or  four  bytes). 

The  write-through  with  invalidatf  scheme  (WTI)  is  commonly  used  in  low-end  commercial 
multiprocessors.  In  this  scheme,  every  write  from  a  processor  accesses  the  bus  both  to  update- 
main  memory  and  to  invalidate  that  location  in  other  caches. 

Examples  of  write-back  with  invalidate  schemes  include  Goodman’s  w-rite-once  [16],  Rudolph 
and  SegalTs  scheme  [17],  Berkeley  Ownership  [14],  and  the  directory  scheme  [20].  We  consider 
write-once  (denoted  WBI)  as  the  second  scheme  in  this  paper.  In  this  scheme,  the  first  write  to 
a  location  uses  the  bus  to  update  main  memory  and  to  invalidate  that  location  in  other  caches. 
Subsequent  writes  to  that  location  by  the  same  processor  do  not  result  in  any  bus  traffic,  as 
that  location  is  now  owned  locally. 

Write-back  with  update  schemes  include  Dragon  [12]  and  Firefly  [13].  We  use  Dragon  as  the 
third  scheme,  and  denote  it  WBIE  In  the  Dragon  scheme,  all  writes  to  a  shared  location  (a 
location  present  in  multiple  caches)  result  in  a  bus  access  to  update  the  value  of  that  location 
in  other  caches.  For  non-shared  locations,  the  cache  acts  like  a  regular  uniprocessor  write-back 
cache. 

We  evaluated  the  performance  of  the  above  three  cache  coherence  schemes  in  terms  of  the 
bus  trani^actions  generated.  bus  transaction  is  generated  on  block  transfers  due  to  mi.s.ses, 
invalidations,  or  updates.  Because  of  our  interest  in  characteristics  of  shared  references,  we 
only  include  epu-shared  user  data  references  for  ParaOPSS,  P-Thor,  and  LocusRoute.  Because 
caches  are  infinite,  a  data  item  brought  into  the  cache  remains  there  until  invalidated 

Before  we  discuss  our  results,  we  examine  how  we  might  choose  an  appropriate  definition  of 
processor  locality  for  a  given  application.  Recall  the  three  variations  of  processor  locality  in  Sec¬ 
tion  4.2.  The  first  form  simply  counts  the  number  of  clinging  references  to  a  block  (r*  w*  rlwp). 
In  other  words,  we  use  the  average  number  of  repeat  references  by  a  processor  to  a  given  block 
of  data.  The  second  form  counts  the  number  of  clinging  references  to  a  block  for  those  runs  that 
included  at  least  one  write  reference  (r*  wf  r/wp).  Figures  2  through  4  plotted  distributions 
using  this  second  form.  The  third  form  counts  just  the  number  of  writes  in  sequences  of  the 
second  form  ( r/t/;,J.  Eggors  and  Katz  define  and  use  the  same  notion  in  their  evaluation  of 
cache  coherence  schemes  in  [.5]. 

The  first  form  is  useful  in  analyzing  cache  coherence  schemes  that  allow  only  one  cached  copy 

■'While  a  detailed  analysis  of  the  numerous  cache  consistency  schemes  proposed  in  the  literature  would  be 
interesting,  it  is  beyond  the  scope  of  this  paper  Instead,  .see  the  simulation  study  of  several  cache  coherence 
schemes  by  Archibald  and  Baer  [18],  and  more  recently,  the  simulation  study  using  real  address  traces  by  Bggers 
and  l\.it/,  ;19) 
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of  a  block  (e.g.,  the  Dir\X  B  scheme  [8]).  1  he  second  form  is  useful  in  examining  owner.shi]) 
ba.'Ot!  protocols,  where  a  wri  er  must  first  ijecoine  tlie  sole  owner  of  a  block  before  prore.-aiing 
with  the  write  The  thiid  fortn  is  necesMiry  to  disti  ignish  between  invalidating  arid  updating 
(or  write-through)  proi  tools. 

W'e  first  co.npared  Tie  pt  formance  of  W'JT,  VVJ-i.  and  WHU  for  4-byte  blocks  using  form 
three  of  proces.^or  local  ty.  C(  mparing  the  number  of  transactions,  we  saw  that  the  WTI  scheme 
was  worse  thaa  both  WBI  and  VV'Bf'.  W  IT  looses  to  VVBI  because  of  the  processor  locality 
displayed  by  write  referencet.  While  every  write  generates  bus  traffic  in  WTI,  clinging  write 
references  do  not  cause  bus  tiaffic  in  WBI.  In  fact,  recall  from  Section  4.2,  that  on  average  there 
were  1.6.  1.7,  and  1.2  writes  in  a  sequence  of  clings  before  a  ping  for  ParaOPSS,  P-Thor.  and 
LocusRoute  respectively.  Based  on  these  numbers  we  verified  that  the  greatest  savings  between 
WTI  and  WBI  to  be  for  P-Thor,  next  greatest  for  ParaOPS5,  and  least  for  LocusRoute.  For 
example,  WBI  in  P-Thor  saves  49%  bus  transactions  over  WTI,  ParaOPSo  saves  31%,.  and 
LocusRoute  saves  11%. 

Comparing  WTI  and  WBU,  both  schemes  generate  an  update  transaction  for  every  writ(> 
to  a  shared  location.  However.  WBU  saves  about  25%  updates  because  before  the  point  that  a 
location  becomes  shared  (a  second  processor  requests  it),  only  the  first  read  or  write  produces  a 
bus  transaction.  WBU  also  has  fewer  block  transfers  becau,se,  unlike  WTI,  it  never  invalidates 
a  location  from  a  cache. 

Let  us  now  compare  WBI  and  WBU.  WBI,  in  general,  will  be  superior  to  WBU  if  there 
were  a  large  number  of  clinging  writes  to  an  object  before  a  ping.  TTiis  is  because.  WBI  doe.s 
not  produce  bus  traffic  after  the  first  write  in  a  sequence  of  clinging  writes.  Again,  recall  from 
Section  4.2  that  on  average  there  are  1.6,  1.7,  and  1.2  clinging  writes  for  ParaOPSS,  P-Thor. 
and  LocusRoute  respectively.  Thus  WBI  has  the  greatest  chance  to  win  over  WBU  for  P-Thor. 
next  for  ParaOPS5,  and  lea^t  for  LocusRoute,  which  is  borne  out  by  simulations.  WBI  w’ins 
over  WBU  by  28%  for  P-Thor,  by  3%  for  ParaOPSS,  and  loses  by  21%  for  LocusRoute. 

Dividing  the  total  number  of  bus  transactions  generated  by  all  three  programs  for  the  WBI 
scheme  (161.6K)  by  the  total  number  of  references  that  resulted  in  these  transactions  (1168.7K), 
we  see  that  there  are  approximately  0.138  bus  transactions  generated  per  reference.  This  number 
appears  quite  large  given  infinite  caches,  and  there  are  two  reasons  for  this.  First,  this  data  rep¬ 
resents  only  epu-shared  user  data  references.  \.  'dch  show  poor  processor  locality  as  in  Figure  2. 
or  equivalently,  which  display  a  high  temporal  locality  of  pinging  references.  Consequently  they 
do  not  benefit  much  from  the  read-sharing  allowed  by  the  WBI  scheme.  If  one  includes  both  user 
and  OS  references,  and  both  data  and  instructions,  then  the  number  of  transactions  per  refer¬ 
ence  falls  to  0.031,  which  is  much  better.  This  reduction  is  primarily  due  to  the  large  number  of 
read-shared  references  generated  by  instruction  fetches.  When  the  block  size  is  increased  from 
4  to  16  bytes,  the  number  of  transactions  per  reference  further  drops  down  to  0.016,  primarily 
due  to  the  high  spatial  locality  of  instruction  fetch  references. 

We  then  examined  the  bus  transactions  generated  by  WBI  as  the  block  size  is  increased  to 
study  the  spatial  locality  characteristics  of  epu-shared  user  data  references,  f'or  this  analysis  the 
second  form  of  processor  locality  is  relevant  because  once  a  block  is  read,  a  transaction  takes 
place  only  on  a  pinging  reference  -  on  a  pinging  read  the  block  must  be  written  back  to  memory, 
while  on  a  write  the  block  must  be  invalidated. 

VV'e  observed  that  the  measure  of  processor  locality  using  the  second  form  correctly  predicts 
the  trends  in  ParaOPSb  and  LocusRoute.  For  example,  the  transaction  rate  in  ParaOP.Sb 
decreases  when  the  block  size  is  changed  from  4  to  16  bytes,  and  the  number  of  transactions 
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incroasos  when  the  block  size  is  further  changed  to  6-i  bytes.  A  corresponding  increasing  trend 
is  observed  in  the  second  form  of  the  processor  locality  parameter  (see  Figures  2  through  4). 

different  trend  is  observed  in  LocusRoute  as  the  block  size  is  increased.  The  transaction 
rate  decrea.ses  as  we  go  from  4  to  16  to  64  bytes,  a  corresponding  improving  trend  is  displayed 
by  the  processor  locality  parameter  for  LocusRoute  as  the  block  size  is  increased. 

The  trends  in  P-Thor,  however,  did  not  match  completely.  A  possible  reason  for  the  dis¬ 
agreement  we  observed  is  that  the  second  form  of  processor  locality  as  defined  by  us  corresponds 
most  closely  to  a  protocol  that  invalidates  a  currently  dirty  copy  of  a  block  in  a  cache  on  a  ping¬ 
ing  read  rather  than  just  performing  the  writeback  and  making  it  clean.  If  a  more  accurate 
processor  locaHty  metric  for  analyzing  performance  of  ownership  protocols  that  clean  rather 
than  invalidate  is  desired,  one  can  measure  the  average  length  of  sequences  of  references  to  a 
block  of  data  by  a  given  processor,  terminating  the  sequences  only  on  pinging  writes.  This  form 
of  processor  locality  is  denoted  r'  Wp.  The  important  observation  is  that  the  notion  of  pings 
and  clings  make  it  possible  to  customize  the  processor  locality  definition  to  suit  a  particular 
application, 

5  Summary  and  Conclusions 

We  have  characterized  locality  in  memory  reference  patterns  of  shared-memory  multiprocessors. 
Our  data  is  based  on  traces  obtained  for  three  appbcations  from  a  4-processor  VAX  8350  using 
the  .ATLM  address  tracing  technique.  About  one-fifth  of  the  references  in  the  traces  are  to 
shared  objects. 

Shared  references  display  a  significant  amount  of  temporal  locality,  but  only  a  moderate 
amount  of  processor  locality  for  both  read  and  write  references.  For  example,  the  average 
number  of  reads  and  writes  to  a  write-shared  block  before  a  remote  reference  (a  ping,  which 
may  possibly  invalidate  the  data)  are  4  and  2  respectively.  Nevertheless,  caching  shared  data 
is  still  highly  useful  because  of  the  significant  amount  of  read  sharing.  Although  the  average 
number  of  writes  to  a  block  before  a  remote  reference  is  just  2,  we  observed  a  high  variance 
in  the  length  of  write  sequences.  We  believe  that  the  use  of  hybrid  updating  and  invalidating 
schemes,  such  as  updating  for  synchronization  objects  and  invalidating  for  others,  or  a  dynamic 
competitive  cache  management  strategy  will  prove  useful  in  such  environments. 

The  locality  characterization  of  the  shared-memory  reference  patterns  also  yields  insight 
on  how  various  cache  consistency  schemes  will  perform.  We  analyzed  three  classes  of  cache 
consistency  schemes — write-through  with  invalidate  (WTI),  write-back  with  invalidate  (WBI), 
and  write-back  with  update  (WBU).  For  shared  data  references,  WTI  performs  worse  than  both 
W'BI  and  WBU  as  it  uses  the  bus  on  every  write.  Comparing  WBI  and  WBU,  the  former  seems 
to  have  an  edge  for  4-byte  blocks,  while  WBU  does  better  for  16-byte  and  64-byte  blocks.  The 
processor  locality  parameter  shows  that  blocks  larger  than  16  bytes  in  P-Thor  and  ParaOPSb 
cause  a  degradation  in  processor  locality,  and  thus  the  total  bus  traffic  increases  rapidly  with 
increasing  block  size.  The  WBU  scheme  is  less  influenced  by  the  block  size  than  WBI  and  WTI 
because  it  always  uses  single  word  updates.  Consequently,  for  large  block  sizes,  WBU  performs 
better  than  WBI  and  WTI  for  all  three  programs. 
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